Goto

Collaborating Authors

 static evaluation




Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks

Ibrahim, Lujain, Huang, Saffron, Ahmad, Lama, Anderljung, Markus

arXiv.org Artificial Intelligence

Model evaluations are central to understanding the safety, risks, and societal impacts of AI systems. While most real-world AI applications involve human-AI interaction, most current evaluations (e.g., common benchmarks) of AI models do not. Instead, they incorporate human factors in limited ways, assessing the safety of models in isolation, thereby falling short of capturing the complexity of human-model interactions. In this paper, we discuss and operationalize a definition of an emerging category of evaluations -- "human interaction evaluations" (HIEs) -- which focus on the assessment of human-model interactions or the process and the outcomes of humans using models. First, we argue that HIEs can be used to increase the validity of safety evaluations, assess direct human impact and interaction-specific harms, and guide future assessments of models' societal impact. Second, we propose a safety-focused HIE design framework -- containing a human-LLM interaction taxonomy -- with three stages: (1) identifying the risk or harm area, (2) characterizing the use context, and (3) choosing the evaluation parameters. Third, we apply our framework to two potential evaluations for overreliance and persuasion risks. Finally, we conclude with tangible recommendations for addressing concerns over costs, replicability, and unrepresentativeness of HIEs.


Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities

Mu, Honglin, Xu, Yang, Feng, Yunlong, Han, Xiaofeng, Li, Yitong, Hou, Yutai, Che, Wanxiang

arXiv.org Artificial Intelligence

With the rise of Large Language Models (LLMs), AI assistants' ability to utilize tools, especially through API calls, has advanced notably. This progress has necessitated more accurate evaluation methods. Many existing studies adopt static evaluation, where they assess AI assistants' API call based on pre-defined dialogue histories. However, such evaluation method can be misleading, as an AI assistant might fail in generating API calls from preceding human interaction in real cases. Instead of the resource-intensive method of direct human-machine interactions, we propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement. In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions, using a LLM-based user agent, equipped with a user script to ensure human alignment. Experimental results highlight that AutoDE uncovers errors overlooked by static evaluations, aligning more closely with human assessment. Testing four AI assistants using our crafted benchmark, our method further mirrored human evaluation compared to conventional static evaluations.


Towards a Human-like Open-Domain Chatbot

Adiwardana, Daniel, Luong, Minh-Thang, So, David R., Hall, Jamie, Fiedel, Noah, Thoppilan, Romal, Yang, Zi, Kulshreshtha, Apoorv, Nemade, Gaurav, Lu, Yifeng, Le, Quoc V.

arXiv.org Machine Learning

We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.


Learning and Using Hand Abstraction Values for Parameterized Poker Squares

Neller, Todd W. (Gettysburg College) | Messinger, Colin M. (Gettysburg College) | Yang, Zuozhi (Gettysburg College)

AAAI Conferences

We describe the experimental development of an AI player that adapts to different point systems for Parameterized Poker Squares. After introducing the game and research competition challenge, we describe our static board evaluation utilizing learned evaluations of abstract partial Poker hands. Next, we evaluate various time management strategies and search algorithms. Finally, we show experimentally which of our design decisions most signicantly accounted for observed performance.


CHESS-PLAYING PROGRAMS AND THE PROBLEM OF COMPLEXITY

AI Classics

Man can solve problems without knowing how he solves them. We shall try to assess recent progress in understanding and mechanizing man's intellectual attainments by considering a single line of attack Chess is the intellectual game par excellence. Such characteristics mark chess as a natural arena for chine, attempts at mechanization. If one could devise a successful chess ma one would seem to have penetrated to the core of human intellectual endeavor. The history of chess programs is an example of the attempt to conceive and cope with complex mechanisms. We return to the original orientation: Humans play chess, and when they do they engage in behavior that seems extremely complex, intricate, and successful. Consider, for example, a scrap of a player's (White's) running comment as he analyzes the position in Figure 1: « Are there any other threats? Knight to Bishop 5 threatening the Queen, and also putting more pressure on the King's side because his Queen's Bishop can come over after he moves his Knight Notice that his analysis is qualitative and functional. He wanders from one feature to another, accumulating various bits of information that will be available from time to time throughout the rest of the analysis. They need not play in exactly the same way; close simulation of the human is not the immediate issue. Complexity of response is dictated by the sponse task, not by idiosyncrasies of the human re mechanism. There is a close and reciprocal relation between complexity and com On the one hand, the complexity of the systems we can specify depends on the language in which we must specify them. Being human, we have only limited capacities for processing information.


cowl '

AI Classics

J. H. Conway, On Numbers and Games, Academic Press, New In this article, a number of concepts that are of importance York, 1976. in research on game-playing programs have been J. H. Conway, "All Games Bright and Beautiful," Am.


A Dynamical Systems Approach for Static Evaluation in Go

Wolf, Thomas

arXiv.org Artificial Intelligence

Abstract--In the paper arguments are given why the concept of static evaluation has the potential to be a useful extension to Monte Carlo tree search. A new concept of modeling static evaluation through a dynamical system is introduced and strengths and weaknesses are discussed. The general suitability of this approach is demonstrated. The concept of Monte-Carlo simulations applied to Go [1] combined with the UCT algorithm [2], [3], which is a tree search method based on Upper Confidence Bounds (UCB) (see e.g. The detailed tournament report [8] of the program MoGo playing against professional and amateur players reveals strengths and weaknesses of MoGo which are typical for programs that perform a Monte Carlo tree search (MCTS). Programs performing MCTS can utilize ever increasing computing power but in their pure form without extra Go knowledge the ratio log(increase in needed computing power) / (increase in strength) is too big to get to professional strength on large boards in the foreseeable future. Therefore in recent years Go knowledge has been incorporated either in form of heuristics, or pattern databases learned from professional games or from self-play. Although treesearch was naturally slowed down the playing strength increased further. With all of this tremendous progress of MCTS compared to the knowledge based era of computer Go summarized in [9], [10], [11], it needs good reasons to start work on a static evaluation function (SE) in Go. One indicator that more Go knowledge needs to be added is that, compared with human playing strength the playing level of current programs decreases as board size increases from 9 9 to 13 13 and then to 19 19. The principal difficulties of deriving knowledge and applying it become more relevant as knowledge is increasingly used in MCTS. Knowledge that is not 100% accurate reduces the scalability of the program when enough computing power is available for global search to replace increasingly the approximate Go knowledge which then becomes less useful or even less accurate than knowledge coming from search. It is difficult to combine knowledge on a high level if it comes from different sources, like from pattern and from local searches. It is one of the reasons of the originally surprising success of pure MCTS that it only uses knowledge from one source (statistics of simulations) without the need of merging different types of knowledge.


Minimaxing: Theory and Practice

Kaindl, Hermann

AI Magazine

Empirical evidence suggests that searching deeper in game trees using the minimax propagation rule usually improves the quality of decisions significantly. However, despite many recent theoretical analyses of the effects of minimax look ahead, however, this phenomenon has still not been convincingly explained. Instead, much attention has been given to so-called pathological behavior, which occurs under certain assumptions. This article supports the view that pathology is a direct result of these underlying theoretical assumptions. Pathology does not occur in practice, because these assumptions do not apply in realistic domains. The article presents several arguments in favor of minimaxing and focuses attention on the gap between their analytical formulation and their practical meaning. A new model is presented based on the strict separation of static and dynamic aspects in practical programs. finally, certain methods of improving minimax look-ahead are discussed, drawing on insights gained from this research.